Pascal Martin
10 octobre 2018
Illumina sequencing: bridge PCR (illustrations from atdbio)
Illustrations from NMBU
… and decreased prices
library(Biostrings)Containers:
“Masked” sequences are also supported (see ?masks)
Manipulation:
dm3_upstream_filepath <- system.file("extdata","dm3_upstream2000.fa.gz",
package="Biostrings")
dm3_upstream <- readDNAStringSet(dm3_upstream_filepath)
dm3_upstream## A DNAStringSet instance of length 26454
## width seq names
## [1] 2000 GTTGGTGGCCCACCAGTGC...GTTTACCGGTTGCACGGT NM_078863_up_2000...
## [2] 2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201794_up_2...
## [3] 2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201795_up_2...
## [4] 2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201796_up_2...
## [5] 2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201797_up_2...
## ... ... ...
## [26450] 2000 ATTTACAAGACTAATAAAG...ATTAAATTTCAATAAAAC NM_001111010_up_2...
## [26451] 2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001015258_up_2...
## [26452] 2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001110997_up_2...
## [26453] 2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001276245_up_2...
## [26454] 2000 CGTATGTATTAGTTAACTC...AAGTGTAAGAACAAATTG NM_001015497_up_2...
dm3_upstream[[5]]## 2000-letter "DNAString" instance
## seq: TTATTTATGTAGGCGCCCGTTCCCGCAGCCAAAG...ATTAATCGATAGATACGGAAAGTCATCCTCGAT
Like LETTERS in base R, the Biostrings package provides a DNA_ALPHABET.
DNAString object containing a random sequence of length 50.Note that masks can also be associated to Biostrings and BSgenome objects
The Rsamtools package provides function to work with large fasta file(s).
The FaFile function creates a reference to an indexed fasta file (see ?FaFile).
This is particularly useful to extract sequences within a fasta file:
library(Rsamtools)
fl <- system.file("extdata", "ce2dict1.fa", package="Rsamtools",
mustWork=TRUE)
fa <- open(FaFile(fl))
seqinfo(fa)## Seqinfo object with 5 sequences from an unspecified genome:
## seqnames seqlengths isCircular genome
## pattern01 18 NA <NA>
## pattern02 25 NA <NA>
## pattern03 24 NA <NA>
## pattern04 24 NA <NA>
## pattern05 25 NA <NA>
getSeq(fa, GRanges("pattern05:11-20"))## A DNAStringSet instance of length 1
## width seq names
## [1] 10 TTTGGTGGTA pattern05
library(BSgenome.Dmelanogaster.UCSC.dm3)names(Dmelanogaster)[1:5]## [1] "chr2L" "chr2R" "chr3L" "chr3R" "chr4"
Dmelanogaster$chr2L## 23011544-letter "DNAString" instance
## seq: CGACAATGCACGACAGAGGAAGCAGAACAGATAT...TATTTGCAAATTTTGATGAACCCCCCTTTCAAA
For a masked version of the genome, see:
library(BSgenome.Dmelanogaster.UCSC.dm3.masked)For adding SNPs info see:
library(BSgenome)
?available.SNPsmatchPattern("KATCGATA", dm3_upstream[[592]], fixed=FALSE)## Views on a 2000-letter DNAString subject
## subject: TCCCAAATTAACTAATGGCTTTTCACGCAGAT...GCCTCACTTTTGTCCACATCTTTTGAAAGGC
## views:
## start end width
## [1] 72 79 8 [GATCGATA]
## [2] 512 519 8 [TATCGATA]
## [3] 518 525 8 [TATCGATA]
K is G or T, see IUPAC code
Other functions to search for patterns: matchProbePair, findPalindromes, …
vmatchPattern('TATCGATA', Dmelanogaster)Probabilistic description of short sequences largely used for TF binding sites
Get a motif (as a PFM):
EcRMotif <- MotifDb::query(MotifDb,"EcR")[1]seqLogo representation:
Convert to PWM:
EcRpfm <- apply(reverseComplement(EcRMotif[[1]]) *
as.integer(mcols(EcRMotif)$sequenceCount),
2, as.integer)
rownames(EcRpfm) <- rownames(EcRMotif[[1]])
EcRpwm <- PWM(EcRpfm)| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|---|
| A | 0.11 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 0.13 | 0.10 |
| C | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 0.13 | 0.00 | 0.00 |
| G | 0.11 | 0.07 | 0.13 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 |
| T | 0.00 | 0.00 | 0.00 | 0.00 | 0.13 | 0.00 | 0.00 | 0.12 |
Search the motif in chr4 (‘+’ strand only):
EcRHits <- matchPWM(EcRpwm, Dmelanogaster$chr4)
length(EcRHits)## [1] 2029
EcRHits[1:2]## Views on a 1351857-letter DNAString subject
## subject: GAATTCGCGTCCGCTTACCCATGTGCCTGTGG...TAAAAGCAGCCGTCGATTTGAGATATATGAA
## views:
## start end width
## [1] 255 262 8 [AGGGTGAT]
## [2] 308 315 8 [AAGGGCAT]
For minus strand: use the reverseComplement of PWM
A typical pipeline:
Find a set of ChIP-seq peaks for a TF as a bed file (eg. ENCODE or modENCODE.
Use a de novo motif search to identify enriched motifs (e.g. RSAT, MEME, BaMM motif, R packages rGADEM or BCRANK)
Get a PFM or PWM from the results. Convert as PWM if necessary.
Scan the genome with the PWM (background frequencies!)
Annotate the identified motif location (TSS? enhancers? near specific groups of genes? etc.)
…
Other packages to work with motifs:
- MotifDb
- seqLogo
- TFBSTools
- rGADEM
- PWMEnrich
- BCRANK
- MotIV
- …
For database queries (+ other tools for AA sequences):
seqinr
pairwiseAlignment is the core function:
pairwiseAlignment('CTTGCAGTGGTGTATTCATAC',
dm3_upstream[[1]],
type='global-local')## Global-Local PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: [1] CTTGCAGTGG-TGTATTCATAC
## subject: [713] CTTGCAGTGGGTGTAT--ATAC
## score: 5.653368
Levensthein edit distance:
stringDist(c("lazy", "HaZy", "crAzY"))## 1 2
## 2 2
## 3 4 5
stringDist(c("lazy", "HaZy", "crAzY"), ignoreCase = TRUE)## 1 2
## 2 1
## 3 2 2
Consider a sequencing run with 10 multiplexed samples. We have the following 16 indexes available.
indx <- DNAStringSet( c("ATCACG", "CGATGT", "TTAGGC", "TGACCA",
"ACAGTG", "GCCAAT", "CAGATC", "ACTTGA",
"GATCAG", "TAGCTT", "GGCTAC", "CTTGTA",
"CGGCTA", "TCCGCG", "ATGTCA", "AGCGAT"))stringDist function to compare these indexes and choose the 10 “best” indexesconsensusMatrix function)\(Q = -10*{\log_{10}(P)}\) <=> \(P = 10^{-\frac{Q}{10}}\)
The ShortRead package (M. Morgan et al. 2009)
Import a fastq file with 20K reads:
fq1_path <- system.file(package="ShortRead","extdata","E-MTAB-1147",
"ERR127302_1_subset.fastq.gz")
myFastq <- readFastq(fq1_path)Explore with:
myFastq## class: ShortReadQ
## length: 20000 reads; width: 72 cycles
myFastq[1:3]## class: ShortReadQ
## length: 3 reads; width: 72 cycles
head(sread(myFastq), 2)## A DNAStringSet instance of length 2
## width seq
## [1] 72 GTCTGCTGTATCTGTGTCGGCTGTCTCGCGG...CAATGAAGGCCTGGAATGTCACTACCCCCAG
## [2] 72 CTAGGGCAATCTTTGCAGCAATGAATGCCAA...GTGGCTTTTGAGGCCAGAGCAGACCTTCGGG
head(quality(myFastq), 2)## class: FastqQuality
## quality:
## A BStringSet instance of length 2
## width seq
## [1] 72 HHHHHHHHHHHHHHHHHHHHEBDBB?B:BBG...FEFBDBD@DDECEE3>:?;@@@>?=BAB?##
## [2] 72 IIIIHIIIGIIIIIIIHIIIIEGBGHIIIIH...IHIIIHIIIIIGIIIEGIIGBGE@DDGGGIG
head(id(myFastq), 2)## A BStringSet instance of length 2
## width seq
## [1] 53 ERR127302.8493430 HWI-EAS350_0441:1:34:16191:2123#0/1
## [2] 53 ERR127302.21406531 HWI-EAS350_0441:1:88:9330:2587#0/1
encoding(quality(myFastq))[seq(1,51,by=2)]## ! # % ' ) + - / 1 3 5 7 9 ; = ? A C E G I K M O Q
## 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48
## S
## 50
alphabet(sread(myFastq))[1:4]## [1] "A" "C" "G" "T"
Functions FastqSampler and FastqStreamer
Count the reads in a fastq file:
nr_myFastq <- 0
strm <- FastqStreamer(fq1_path,1000)
repeat {
## Get FASTQ chunk:
fq <- yield(strm)
if (length(fq) == 0)
break
## Do something on the chunk:
nr_myFastq <- nr_myFastq + length(fq)
}
close(strm) #close the connection
nr_myFastq## [1] 20000
fastqc from Babraham
Several tools available in R/BioC: ShortRead qa/qa2 functions, qrqc, seqTools, Rqc
Run library(Rqc) on a fastq file:
rqcResultSet <- rqcQA(fq1_path, sample=TRUE)rqcCycleQualityPlot(rqcResultSet)rqcCycleBaseCallsLinePlot(rqcResultSet)Define some filters:
max1N <- nFilter(threshold=1L) #No 'Ns' in the reads
goodq <- srFilter(function(x){apply(as(quality(x),"matrix"),
1,median,na.rm=T)>=30},
name="MedianQualityAbove30")
myFilter <- compose(max1N,goodq) #combine filtersCreate a function to filter and trim the reads:
FilterAndTrim <- function(fl,destination=sprintf("%s_filtered",fl))
{
stream <- FastqStreamer(fl) ## open input stream
on.exit(close(stream))
repeat {
###get fastq chunk
fq <- yield(stream)
if (length(fq)==0)
break
###TRIM first 4 and last 2 bases
fq <- narrow(fq,start=5,end=70)
###FILTER
fq <- fq[myFilter(fq)]
###write filtered fastq
writeFastq(fq, destination, mode="a")
}
}Apply the function:
FilterAndTrim(fqFiles[1],
destination=file.path(getwd(),"FilteredFastq.fq"))R packages: Rbowtie, Rbowtie2, QuasR (Gaidatzis et al. 2015), Rsubread (Liao, Smyth, and Shi 2013; Liao, Smyth, and Shi 2014)
Mapping quality scores (MAQ aligner in Li, Ruan, and Durbin 2008):
- base qualities (Phred scores)
- mismatches/indels
- repetitions
- paired reads
See also Wikipedia’s list of alignment tools
library(Rsamtools);library(GenomicAlignments)
library(pasillaBamSubset)
sr <- untreated1_chr4() #single-end
pr <- untreated3_chr4() #paired-endUse samtools to index the file
indexBam(sr_bamFile)All functions from samtools are available with R (e.g. sortBam, countBam, filterBam, mergeBam, etc.)
Define what to import
which=GRanges(seqnames="chr4",
ranges=IRanges(c(75000,1190000),
c(85000,1203000)),
strand="*")
what = c("rname","strand","pos","qwidth","seq")
flag=scanBamFlag(isDuplicate=FALSE)
param=ScanBamParam(which=which,what=what,flag=flag)srbam <- readGAlignments(sr,param=param)
srbam[1:2]## GAlignments object with 2 alignments and 5 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] chr4 + 21M13615N50M55N4M 75 72990 86734
## [2] chr4 - 4M13615N50M55N21M 75 73007 86751
## width njunc | rname strand pos qwidth
## <integer> <integer> | <factor> <factor> <integer> <integer>
## [1] 13745 2 | chr4 + 72990 75
## [2] 13745 2 | chr4 - 73007 75
## seq
## <DNAStringSet>
## [1] AAAAACTGCA...CGTAGCCACA
## [2] ATACCTGTGA...TGGACGGCTG
## -------
## seqinfo: 8 sequences from an unspecified genome
prbam <- readGAlignmentPairs(pr)
prbam[1:2]## GAlignmentPairs object with 2 pairs, strandMode=1, and 0 metadata columns:
## seqnames strand : ranges -- ranges
## <Rle> <Rle> : <IRanges> -- <IRanges>
## [1] chr4 + : [169, 205] -- [ 326, 362]
## [2] chr4 + : [943, 979] -- [1086, 1122]
## -------
## seqinfo: 8 sequences from an unspecified genome
There is also a BamFile function to create a reference to a large BAM file without importing it (as for FaFile).
See also the GenomicFiles package for working on many, large files
Packages IRanges & GenomicRanges (Lawrence et al. 2013)
See also the Introduction by Martin Morgan, 2014
A simple IRanges:
eg = IRanges(start = c(1, 10, 20),
end = c(4, 10, 19),
names = c("A", "B", "C"))
eg## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## A 1 4 4
## B 10 10 1
## C 20 19 0
A bigger IRanges:
set.seed(123) #For reproducibility
start = floor(runif(10000, 1, 1000))
width = floor(runif(10000, 0, 100))
ir = IRanges(start, width=width)
ir## IRanges object with 10000 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 288 318 31
## [2] 788 819 32
## [3] 409 495 87
## [4] 883 914 32
## [5] 940 951 12
## ... ... ... ...
## [9996] 466 492 27
## [9997] 899 983 85
## [9998] 114 124 11
## [9999] 571 595 25
## [10000] 900 965 66
Vector-like behavior: length, [
Accessors: start, end, width, names
length(ir)## [1] 10000
width(ir[1:4])## [1] 31 32 87 32
names(eg)## [1] "A" "B" "C"
Some useful functions:
mid(ir[1:4])## [1] 303 803 452 898
successiveIRanges(width=rep(10,3),gap=10)## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 10 10
## [2] 21 30 10
## [3] 41 50 10
tile(ir[1:2],n=2)## IRangesList of length 2
## [[1]]
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 288 302 15
## [2] 303 318 16
##
## [[2]]
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 788 803 16
## [2] 804 819 16
irl <- split(ir,width(ir)) # an IRangesList
irl[[1]][1:3]## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 321 320 0
## [2] 600 599 0
## [3] 184 183 0
length(irl)## [1] 100
head(elementNROWS(irl))## 0 1 2 3 4 5
## 96 83 108 95 84 110
List = list with all elements of the same type. See ?List
start(irl)[1:2]## IntegerList of length 2
## [["0"]] 321 600 184 297 276 816 87 729 ... 858 52 85 188 308 289 936 669
## [["1"]] 915 576 706 235 678 647 451 138 ... 638 66 979 740 300 869 433 645
log(start(irl)[1:2])## NumericList of length 2
## [["0"]] 5.77144112313002 6.39692965521615 ... 6.50578406012823
## [["1"]] 6.81892406527552 6.35610766069589 ... 6.46925031679577
Package GenomicRanges:
library(GenomicRanges)A typical use case:
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
(gr <- exons(txdb))## GRanges object with 76920 ranges and 1 metadata column:
## seqnames ranges strand | exon_id
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr2L [7529, 8116] + | 1
## [2] chr2L [8193, 8589] + | 2
## [3] chr2L [8193, 9484] + | 3
## ... ... ... ... . ...
## [76918] chrUextra [523024, 523048] - | 76918
## [76919] chrUextra [523024, 523086] - | 76919
## [76920] chrUextra [523060, 523086] - | 76920
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
Operations on GRanges are generally seqnames-aware and strand-aware (see argument ignore.strand)
(grl <- exonsBy(txdb,by="gene"))## GRangesList object of length 15682:
## $FBgn0000003
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr3R [2648220, 2648518] + | 45123 <NA>
##
## $FBgn0000008
## GRanges object with 13 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## [1] chr2R [18024494, 18024531] + | 20314 <NA>
## [2] chr2R [18024496, 18024713] + | 20315 <NA>
## [3] chr2R [18024938, 18025756] + | 20316 <NA>
## ... ... ... ... . ... ...
## [11] chr2R [18059821, 18059938] + | 20328 <NA>
## [12] chr2R [18060002, 18060339] + | 20329 <NA>
## [13] chr2R [18060002, 18060346] + | 20330 <NA>
##
## ...
## <15680 more elements>
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
See ?'intra-range-methods'
Also reflect, narrow and threebands, restrict and trim
See ?'inter-range-methods'
See ?'setops-methods'
See also ?'nearest-methods' including nearest, precede, follow and distance,
Q: Number of TSS located at >500bp from another gene?
Get all genes and transcrits:
Dmg <- genes(txdb)
Dmt <- transcriptsBy(txdb,by="gene")Get all TSS:
Dm_tss <- unlist(reduce(promoters(Dmt,up=0,down=1),min.gap=0L))Proportion of TSS overlapping with more than 1 gene +/- 500bp:
mean(countOverlaps(Dm_tss,Dmg+500) > 1) #!strand-aware## [1] 0.2509943
mean(countOverlaps(Dm_tss,Dmg+500,ignore.strand=T) > 1)## [1] 0.5167952
Obtaining the overlaps:
fov <- findOverlaps(Dm_tss,Dmg+500,ignore.strand=T) ; fov[1:3]## Hits object with 3 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 1
## [2] 1 1383
## [3] 2 2
## -------
## queryLength: 20869 / subjectLength: 15682
Two genes on opposite strands that are overlapping:
Dmg[subjectHits(fov)[queryHits(fov)==1]]## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## FBgn0000003 chr3R [2648220, 2648518] + | FBgn0000003
## FBgn0011904 chr3R [2648685, 2648757] - | FBgn0011904
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
Q: How many reads in srbam overlap with gene FBgn0002521?
length(subsetByOverlaps(srbam,Dmg["FBgn0002521"]))## [1] 358
Q: How many reads in srbam overlap with exons of FBgn0002521?
length(srbam[overlapsAny(srbam,grl[["FBgn0002521"]])])## [1] 346
Reads mapping on exons:
ctex <- summarizeOverlaps(features = grl[seqnames(Dmg)=="chr4"],
reads = srbam,
mode = Union)head(assays(ctex)$counts)## reads
## FBgn0002521 346
## FBgn0004607 0
## FBgn0004859 4
## FBgn0005558 0
## FBgn0005561 0
## FBgn0005666 0
(Huber et al. 2015)
srbam <- readGAlignments(sr)
(covr <- coverage(srbam))## RleList of length 8
## $chr2L
## integer-Rle of length 23011544 with 1 run
## Lengths: 23011544
## Values : 0
##
## $chr2R
## integer-Rle of length 21146708 with 1 run
## Lengths: 21146708
## Values : 0
##
## $chr3L
## integer-Rle of length 24543557 with 1 run
## Lengths: 24543557
## Values : 0
##
## $chr3R
## integer-Rle of length 27905053 with 1 run
## Lengths: 27905053
## Values : 0
##
## $chr4
## integer-Rle of length 1351857 with 122061 runs
## Lengths: 891 27 5 12 13 45 ... 3 106 75 1600 75 1659
## Values : 0 1 2 3 4 5 ... 6 0 1 0 1 0
##
## ...
## <3 more elements>
Genes on chromosome 4:
gn4 <- Dmg[seqnames(Dmg)=="chr4"]Extract gene level profiles:
profgn4 <- covr[gn4]
profgn4[strand(gn4)=="-"] <- lapply(profgn4[strand(gn4)=="-"],rev)
names(profgn4) <- names(gn4) ; profgn4[1:2]## RleList of length 2
## $FBgn0002521
## integer-Rle of length 9178 with 825 runs
## Lengths: 86 5 26 1 10 2 13 13 5 5 ... 29 1 1 1 1 1 11 3 4 13
## Values : 0 1 2 4 5 6 7 8 9 8 ... 8 7 6 7 8 7 6 5 6 5
##
## $FBgn0004607
## integer-Rle of length 37983 with 41 runs
## Lengths: 11762 75 1336 15 60 ... 94 75 1986 75 829
## Values : 0 1 0 1 2 ... 0 1 0 1 0
Extract the first 1Kb as a matrix:
profgn4 <- profgn4[elementNROWS(profgn4)>=1000]
profgn4 <- as(lapply(profgn4,window,1,1000),"RleList")
mat1kb <- matrix(as.numeric(unlist(profgn4, use.names=F)),
nrow=length(profgn4), byrow=T,
dimnames=list(names(profgn4),NULL))
mat1kb <- mat1kb[rowSums(mat1kb)>0,]Plot the average profile:
df1Kb <- data.frame(Coordinate=1:1000,
Coverage=apply(mat1kb,2,mean,na.rm=T,trim=0.03))
ggplot(df1Kb,aes(x=Coordinate,y=Coverage))+
geom_line()getSeq(Dmelanogaster,gn4[1:2])## A DNAStringSet instance of length 2
## width seq names
## [1] 9178 ATCGAATACCCATGCCAAACA...ATAAAAGTACGTTAACAGCA FBgn0002521
## [2] 37983 CAGCTCAGTCGAAAAAAAACG...AACGTACATTTATACGTCCT FBgn0004607
Views(Dmelanogaster,gn4[1:2])## BSgenomeViews object with 2 views and 1 metadata column:
## seqnames ranges strand dna
## <Rle> <IRanges> <Rle> <DNAStringSet>
## FBgn0002521 chr4 [1193094, 1202271] - [ATCGAATACC...GTTAACAGCA]
## FBgn0004607 chr4 [ 522436, 560418] + [CAGCTCAGTC...TATACGTCCT]
## | gene_id
## | <character>
## FBgn0002521 | FBgn0002521
## FBgn0004607 | FBgn0004607
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
select(org.Dm.eg.db,
keys=c('FBgn0015664','FBgn0015602'),keytype="FLYBASE",
columns=c('SYMBOL','UNIGENE','ENTREZID','FLYBASECG'))## FLYBASE SYMBOL UNIGENE ENTREZID FLYBASECG
## 1 FBgn0015664 Dref Dm.7169 34328 CG5838
## 2 FBgn0015602 BEAF-32 Dm.9114 36645 CG10159
library(AnnotationHub)
hub <- AnnotationHub()
length(hub) # >43500 datasets
unique(hub$dataprovider)
head(unique(hub$species))
head(unique(ah$rdataclass))
Thank you for your attention!
Brenner, S., M. Johnson, J. Bridgham, G. Golda, D. H. Lloyd, D. Johnson, S. Luo, et al. 2000. “Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays.” Nat. Biotechnol. 18 (6): 630–34.
Davis, S., and P. S. Meltzer. 2007. “GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 23 (14): 1846–7.
Dobin, A., C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T. R. Gingeras. 2013. “STAR: ultrafast universal RNA-seq aligner.” Bioinformatics 29 (1): 15–21.
Durinck, S., Y. Moreau, A. Kasprzyk, S. Davis, B. De Moor, A. Brazma, and W. Huber. 2005. “BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis.” Bioinformatics 21 (16): 3439–40.
Durinck, S., P. T. Spellman, E. Birney, and W. Huber. 2009. “Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt.” Nat Protoc 4 (8): 1184–91.
Gaidatzis, D., A. Lerch, F. Hahne, and M. B. Stadler. 2015. “QuasR: quantification and annotation of short reads in R.” Bioinformatics 31 (7): 1130–2.
Gentleman, Robert C, Vincent J. Carey, Douglas M. Bates, and others. 2004. “Bioconductor: Open Software Development for Computational Biology and Bioinformatics.” Genome Biology 5: R80. http://genomebiology.com/2004/5/10/R80.
Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating high-throughput genomic analysis with Bioconductor.” Nat. Methods 12 (2): 115–21.
Kim, D., B. Langmead, and S. L. Salzberg. 2015. “HISAT: a fast spliced aligner with low memory requirements.” Nat. Methods 12 (4): 357–60.
Kim, D., G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L. Salzberg. 2013. “TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.” Genome Biol. 14 (4): R36.
Langmead, B., and S. L. Salzberg. 2012. “Fast gapped-read alignment with Bowtie 2.” Nat. Methods 9 (4): 357–59.
Langmead, B., C. Trapnell, M. Pop, and S. L. Salzberg. 2009. “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.” Genome Biol. 10 (3): R25.
Lawrence, M., and M. Morgan. 2014. “Scalable Genomics with R and Bioconductor.” Statistical Science 29 (2): 214–26.
Lawrence, M., R. Gentleman, and V. Carey. 2009. “rtracklayer: an R package for interfacing with genome browsers.” Bioinformatics 25 (14): 1841–2.
Lawrence, M., W. Huber, H. Pages, P. Aboyoun, M. Carlson, R. Gentleman, M. T. Morgan, and V. J. Carey. 2013. “Software for computing and annotating genomic ranges.” PLoS Comput. Biol. 9 (8): e1003118.
Li, H., and R. Durbin. 2009. “Fast and accurate short read alignment with Burrows-Wheeler transform.” Bioinformatics 25 (14): 1754–60.
———. 2010. “Fast and accurate long-read alignment with Burrows-Wheeler transform.” Bioinformatics 26 (5): 589–95.
Li, H., J. Ruan, and R. Durbin. 2008. “Mapping short DNA sequencing reads and calling variants using mapping quality scores.” Genome Res. 18 (11): 1851–8.
Liao, Y., G. K. Smyth, and W. Shi. 2013. “The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.” Nucleic Acids Res. 41 (10): e108.
———. 2014. “featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.” Bioinformatics 30 (7): 923–30.
Marco-Sola, S., M. Sammeth, R. Guigo, and P. Ribeca. 2012. “The GEM mapper: fast, accurate and versatile alignment by filtration.” Nat. Methods 9 (12): 1185–8.
Morgan, M., S. Anders, M. Lawrence, P. Aboyoun, H. Pages, and R. Gentleman. 2009. “ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data.” Bioinformatics 25 (19): 2607–8.
Pollard, M. O., D. Gurdasani, A. J. Mentzer, T. Porter, and M. S. Sandhu. 2018. “Long reads: their purpose and place.” Hum. Mol. Genet. 27 (R2): R234–R241.
Pyl, P. T., J. Gehring, B. Fischer, and W. Huber. 2014. “h5vc: scalable nucleotide tallies with HDF5.” Bioinformatics 30 (10): 1464–6.
Tippmann, S. 2015. “Programming tools: Adventures with R.” Nature 517 (7532): 109–10.
Trapnell, C., L. Pachter, and S. L. Salzberg. 2009. “TopHat: discovering splice junctions with RNA-Seq.” Bioinformatics 25 (9): 1105–11.
Wang, K., D. Singh, Z. Zeng, S. J. Coleman, Y. Huang, G. L. Savich, X. He, et al. 2010. “MapSplice: accurate mapping of RNA-seq reads for splice junction discovery.” Nucleic Acids Res. 38 (18): e178.
Zhu, Y., R. M. Stephens, P. S. Meltzer, and S. R. Davis. 2013. “SRAdb: query and use public next-generation sequencing data from within R.” BMC Bioinformatics 14 (January): 19.